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Abstract 



Differential privacy has recently emerged as the de facto standard for private data release. This makes it possible 
to provide strong theoretical guarantees on the privacy and utility of released data. While it is well-understood how 
to release data based on counts and simple functions under this guarantee, it remains to provide general purpose 
techniques that are useful for a wider variety of queries. In this paper, we focus on spatial data, i.e., any multi- 
dimensional data that can be indexed by a tree structure. Directly applying existing differential privacy methods to 
this type of data simply generates noise. 

We propose instead the class of "private spatial decompositions": these adapt standard spatial indexing methods 
such as quadtrees and kd-trees to provide a private description of the data distribution. Equipping such structures with 
differential privacy requires several steps to ensure that they provide meaningful privacy guarantees. Various basic 
steps, such as choosing splitting points and describing the distribution of points within a region, must be done pri- 
vately, and the guarantees of the different building blocks must be composed into an overall guarantee. Consequently, 
we expose the design space for private spatial decompositions, and analyze some key examples. A major contribu- 
tion of our work is to provide new techniques for parameter setting and post-processing of the output to improve the 
accuracy of query answers. Our experimental study demonstrates that it is possible to build such decompositions 
efficiently, and use them to answer a variety of queries privately and with high accuracy. 

1 Introduction 

Releasing representative data sets which do not compromise the privacy of data subjects has occupied serious effort 
in the database community in recent years. The paradigm of differential privacy has recently emerged as the favored 
definition: it ensures that what can be learned from the released data does not substantially differ whether or not any 
given individual's data is included. This is intended to reassure data subjects that their participation in the process does 
not directly lead to information about them being revealed. The question that the data management community must 
now address is how to provide differential privacy guarantees, while ensuring that the result is not just private, but also 
useful. 

Consider a dataset containing locations of individuals at a particular time, e.g., GPS locations or home addresses. 
Such data could be used for many applications: transportation planning, facility location, political boundary drawing 
etc. More generally, any data set where attributes are ordered and have moderate to high cardinality (e.g., numerical 
attributes such as salary) can be considered spatial data: whenever data can be indexed by a tree structure (such as a 
B-tree, R-tree, kd-tree etc.), it is implicitly being treated as spatial. In spatial applications, a basic primitive is to know 
how many individuals fall within a given region (a multi-dimensional range query). Our aim is to release information 
which allows such queries to be answered accurately, while giving a strong guarantee of privacy on people's locations. 

Although this is a natural and important problem, there has been limited prior work that can be applied directly to 
this setting. The most straightforward method is to lay down a fine grid over the data, and add noise from a suitable 
distribution to the count of individuals within each cell [6J. For example, suppose we represent a set of 10^ GPS 
locations in a grid of 10 meter x 10 meter squares over the USA territory: this yields a total of approximately 10^^ 
entries, most of which are either or 1. Because the purpose of the noise is to mask whether there is an individual 
there, the output is simply a large mass of noisy counts, with little information remaining to accurately answer queries. 
Any query which touches, e.g., a 1% fraction of the area includes over 10^ noisy counts, which translates into a huge 
eiTor. 
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In this paper, we aim to balance the requirements of practicaHty and utility, while achieving a desired privacy 
guarantee. In particular, we design a class of differentially private spatial decompositions (PSDs). These partition 
the space into smaller regions, and report statistics on the points within each region. Queries can then be answered 
by intersecting the query region with the decomposition. By performing a spatial decomposition which results in 
compact regions with sufficiently many points and a more uniform distribution, we expect query answers to be more 
accurate. 

There are many existing (non-private) approaches to spatial decompositions. Some are data-independent, such as 
quadtrees which recursively divide the data space into equal quadrants. Other methods, such as the popular kd-trees, 
aim to better capture the data distribution and so are data-dependent. In this case, there are additional challenges: the 
description of the regions must also be differentially private. In other words, simply building a tree-structure over the 
exact data and populating it with noisy counts is not sufficient to meet the differential privacy definition: the choice of 
splits is based on the true data, and potentially reveals information. 

A second issue is how to answer queries given a PSD. In the non-private case, it is fairly straightforward to take a 
query region and intersect it with the data structure to obtain a count of contained nodes. In the private case, it is more 
complicated: because we have count estimates for large regions and subregions, there are multiple ways to decompose 
a query region, and the answers may vary due to differing noise. To achieve optimal query accuracy, we develop two 
new techniques that may be of independent interest: 

• First, we show that using non-uniform noise parameters can significantly improve accuracy, while maintaining 
the same privacy guarantee. Specifically, we propose setting the parameters in a geometric progression, in- 
creasing from root to leaves. To the best of our knowledge, this is the first result that analyzes the impact of 
non-uniform noise parameters. 

• Second, we design a new method to compute optimal (minimum variance) answers to queries by post-processing 
the noisy counts. Our method works over a large class of non-uniform noise parameters. It generalizes the 
method of ifTTI . which applied only to uniform noise parameters. 

Putting these pieces together, we show how to build private versions of well-known data structures, such as kd- 
trees, R-trees, and quadtrees, in multiple dimensions. This gives a full framework for privately representing spatial 
data. Our experimental study ranges over this framework, and allows us to see the impact of different choices on the 
accuracy of query answering over a mixture of real and synthetic data. We show that our two novel techniques can 
reduce the absolute error of queries by up to an order of magnitude. 

Outline. In Section[2]we describe related work, while Section[3]surveys concepts from differential privacy and spatial 
decompositions. We build our framework as follows: In Section]?] we show how to set non-uniform noise parameters 
in a hierarchical structure. Section]5]describes the post-processing method. In Section|6] we describe how to effectively 
find private median points under differential privacy, with further extensions in Section |7] Our experimental study in 
Section[8]considers different optimizations for different families of PSDs, and then compares these all together to find 
the best choices for working privately with spatial data. We also compare to two prior approaches Ii26i il2J and show 
that our methods outperform them. 

2 Related Work 

In this section, we present a brief summary of the most related topics in anonymization and privacy. For more details, 
there are several recent surveys and tutorials 121 SI El ID . 

Initial efforts to ensure privacy of released data were based on syntactic definitions such as ^-anonymity ||23l 
and ^-diversity |16|. Subsequent efforts tried to provide a more semantic guarantee, ensuring that no matter what 
knowledge or power an adversary had, their ability to draw conclusions about an individual in the data was limited. 
This culminated in the definition of differential privacy |6 | which ensures that the probability of any property holding 
on the output is approximately the same, whether or not an individual is present in the source data. 

Although often introduced in the context of an interaction between a data owner and a data user, differential privacy 
naturally adapts to a non-interactive setting where the data owner wishes to release private information derived from 
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their data. Conceptually, this is quite straightforward: the data owner determines what collection of statistics should 
be released, then computes these under some mechanism which provides an overall privacy guarantee. Initial work 
showed how to release contingency tables (equivalent to count cubes) in this 'publishing' model [1 1. Subsequent work 
in the database community also adopted the model of releasing tables of counts (histograms) and studied how to ensure 
these are accurate for different query workloads 1 15ll25i . 

There has been much work in producing mechanisms which offer differential privacy, such as the general purpose 
exponential mechanism lfT9l . and the geometric mechanism ifTOl . Approaches such as smoothed sensitivity have 
attempted to adapt the amount of noise needed to the instance of the data |20|. These mechanisms have been applied to 
various problems, such as releasing contingency tables [1], time series [22J, and recommender systems 1 18|. Recently, 
efforts have been made to improve the accuracy of (count) queries, by allowing ranges to be expressed via a smaller 
number of noisy wavelet or range coefficients fT5^,'25'|, and via post-processing with overlapping information 1 1 1 1. 

However, there has been limited work that applies to spatial data. The idea of differentially private data-partitioning 
index structures was previously suggested in the context of private record matching 1 1 21 . The approach there is based 
on using an approximate mean as a surrogate for median (on numerical data) to build kd-trees. This becomes a special 
case in our general framework. In Section 8.2 we show that there are much better methods to choose medians privately. 
We also compare to this case as part of our overall experimental study, and also for a record matching application in 
Section 113] 

The only prior work directly addressing spatial data follows the approach suggested above, and imposes a fixed 
resolution grid over the base data 1261 . It then builds a kd-tree based on noisy counts in the grid, splitting nodes 
which are not considered "uniform", and then populates the final leaves with "fresh" noisy estimated counts. In our 
experimental study (SectionlHl, we observe that this approach is inferior to other points in the framework. 



3 Preliminaries 
3.1 Differential Privacy 

We now formally introduce the concepts behind differential privacy. Let 2?i,2?2 be two neighboring datasets, i.e., 2?i 
and 2?2 differ in only one tuple f, written as ||2?i — 1'2|| = 1. In some cases, this is taken to mean that f has different 
values in the two datasets; in other cases, it is interpreted as meaning that f is present in only one of the two datasets. 
Either version leads to privacy guarantees which differ by at most a constant factor For concreteness, in this paper we 
consistently use the second definition. 

Definition 1 Let Vi , I?2 be two neighboring datasets. Let A denote a randomized algorithm over datasets, and S be 
an arbitrary set of possible outputs of A. Algorithm A is said to be £-dijferentially private if 

Pr[^CDi) eS]< e'Pi:[AiV2) G S]. 



Intuitively, differential privacy guarantees that no individual tuple can significantly affect the released information: 
the output distribution generated by A is nearly the same, whether or not that tuple is present in the dataset. The most 
common techirique for designing differentially -private algorithms was proposed in ||8|. It is a noise adding mechanism, 
as follows. 



Definition 2 (Laplace meclianism) Let f{T>) denote a numeric function over dataset T). An e-differentially private 
mechanism for releasing f is to publish C{'D) — /(I?) +X, where X is a random variable drawn from the Laplace 
distribution Lap((7(/) /e). 

The value CJ(/), called the sensitivity of /, is the maximum change in f when any single tuple ofD changes. 
Formally: 

a{f)^ max |/(2?i) -/p2)|. 

Pl.P2:||'Di-X'2|| = l 

For example, if / = count, then o{f) ~ 1 : for any two neighboring datasets 2?i and 2?2> the cardinalities of Vi 
and 2?2 differ by 1 . 
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Figure 1 : Example of private quadtree: noisy counts (inside boxes) are released; actual counts, although depicted, are 
not released. Query Q (dotted red rectangle) could be answered by adding noisy counts of marked nodes. 



3.2 Spatial Decompositions 

Spatial indexes are routinely used in database management, with a long history stretching over decades Il24l l5]l. Their 
main purpose is to allow queries to determine quickly whether or not a particular point is present in the data set. 
Clearly, we cannot extend this functionality to a privacy-preserving setting. However, spatial indexes also allow 
efficient computation for a rich set of aggregate queries, particularly range queries. We focus on these queries when 
studying the utility of PSDs. 

Formally, a spatial decomposition is a hierarchical (tree) decomposition of a geometric space into smaller areas, 
with data points partitioned among the leaves. Indexes are usually computed down to a level where the leaves either 
contain a small number of points, or have a small enough area, or a combination of the two. Unless otherwise specified, 
we assume that the tree is complete, i.e., all leaf-to-root paths have the same length, and all internal nodes have the 
same fanout. 

The following classification of spatial decompositions is widely used. We will describe our private spatial decom- 
positions following the same taxonomy. 

Data-independent decompositions. The best known example in this category is the quadtree in two dimensions, 
and its generalizations to higher dimensions (octree, etc.) Their defining feature is that their structure is pre-defined, 
and depends only on the domain of the attributes. E.g., in the quadtree, nodes are recursively divided into four equal 
regions via horizontal and vertical lines through the midpoint of each range El- 

Data-dependent decompositions. Here, the hierarchical partition of nodes is based on the input. Examples include: 

- Kd-trees: Nodes in a kd-tree are recursively split via lines passing through the median data value along some 
coordinate axis. The splitting axis changes at each level, in a cyclic order. 

- Hilbert R-trees: The R-tree is a spatial decomposition formed by nested (hyper)rectangles which may overlap. 
A Hilbert R-tree is computed as follows: Map all points to a Hilbert space-filling curve of sufficiently large order 
(see IIT3I for details). Build a binary tree on the Hilbert values, then map it back into the original space. The bounding 
boxes of data points corresponding to nodes in the tree form the Hilbert R-tree. 

Hybrid trees. Although less studied in the classical setting, a hybrid approach can be particularly useful in a privacy- 
aware model. In particular, we consider using data-dependent splits for the first £ levels of the tree (for some £ decided 



in advance), then switch to data-independent splitting (e.g., quadtree) for the remainder We observe in Section 8.2 
that queries over private hybrid trees are generally more accurate than with many other indexes. 



3.3 Building Private Spatial Decompositions 

We are now ready to describe how to build PSDs. The simplest PSD is formed by instantiating a data-independent tree 
such as a full quadtree, but computing counts for each node via the Laplace mechanism. See Figure [T]for an example. 

Recall that the concept of sensitivity (Definition |2]) captures the maximum change in the output caused by the 
presence or absence of a single tuple. For data-independent trees, releasing the structure of the index (i.e., the node 
rectangles) does not endanger the privacy of any individual. The only data-dependent features are the node counts: 
each node stores the number of data points that lie in the spatial cell associated with it. Note that adding or deleting a 
single tuple changes the counts of all the nodes on the path from the root to the leaf containing that tuple. Therefore, 
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to obtain a tree that satisfies e-differential privacy as a whole, we need to combine the privacy guarantees of individual 
node counts. We use the following well-known composition lemma: 

Lemma 1 [18] Let Ai,. . . ,At be t algorithms such that At satisfies Ei-dijferential privacy, 1 < i < t. Then their 
sequential composition {Ai , . . . , At) satisfies e-dijferential privacy, for £ = Y!i=i ^i- 

In any partition tree, we only need to consider sequential compositions of private counts along a path: if nodes u 
and V are not on the same root-to-leaf path, their node counts are independent of each other, so knowing the output of 
Ai, does not affect the privacy guarantee of Ay 

This outline provides a simple PSD. However, we have developed several optimization techniques which can 
significantly improve the utility of PSDs, which is the technical focus of this paper Section]?] shows how to choose 
the values e,-, while Section]5]shows how to process the noisy counts to maximize query accuracy. These results apply 
to all spatial decompositions. For data-dependent and hybrid trees, to guarantee differential privacy, the structure of 
the tree itself must also be perturbed by noise. We study this issue in Section|6]for kd-trees. We also consider building 
private Hilbert R-trees, which are binary trees (i.e., one-dimensional kd-trees) in the Hilbert space. When mapping 
back to the original space, we use the bounding boxes of Hilbert values corresponding to tree nodes to define the node 
rectangles (thus preserving privacy). For brevity, we do not explicitly discuss Hilbert R-trees further, although we 
include them in our experimental evaluation in Section|8] 

4 Allocating Noise Parameters 

In this section, we focus on how to choose noise parameters £, such that the composition rule is satisfied for all tree 
paths (see Section ]33] l. Let h denote the height of the tree; leaves have level and the root has level h. We assume that 
all nodes at level / have the same Laplace parameter e, (other choices are discussed at the end of this section). Hence, 
given a total privacy "budget" of e, we need to specify < / < h, such that YJI^o ~ ^- refer to a choice of e,'s 
as a budget strategy. The goal is to minimize the resulting query errors. 

Error measure. For any query Q, let Q denote the answer to Q computed over the private tree. Then 2 is a random 
variable which is an unbiased estimator of the true answer (since noise has mean 0). Its variance Yar{Q) is a strong 
indicator of query accuracy. As in prior work |25jil5J, we define the error measure to be Err(Q) = Yai{Q). The error 
of a query workload Qu--,Qsi& LLi Err(g,)/«- 

4.1 Query Processing 

Unlike a tree over the original (unperturbed) counts, a PSD may return many different results to a query Q. 

Query processing example. Figure [T] shows a possible processing for query Q, which sums the noisy counts in nodes 
b, u5, ul, u9, ulO and ul3. The answer is 2. However, if we replace b's count by the sum of its children's counts, and 
the sum of u5 and u7 by the difference between c's count and the sum of m6, m8, the answer becomes 8. This is because 
the noise is independent, and there are multiple ways of representing Q as a union or difference of node rectangles. 
Adding/subtracting corresponding noisy counts yields different results. ■ 

Consequently, to analyze Err(2), we must first describe a standard method for computing Q. Let Y be the set of 
noisy counts, and let U be the set of nodes used to answer Q. Then Err(2) = Y.ueu ^'^{Yu), i.e., the total variance is the 
sum of the node variances. Thus the error grows, to a first approximation, with the number of noisy counts included. 
We adopt the canonical range query processing method 15], which minimizes the number of added counts. 

The method is as follows: Starting from the root, visit all nodes u whose corresponding rectangle intersects Q. If u 
is fully contained in Q, add the noisy count Fj, to the answer Q; otherwise, recurse on the children of u, until the leaves 
are reached. If a leaf a intersects Q but is not contained in Q, use a uniformity assumption to estimate what fraction of 
Ya should be added to Q. 

Let n{Q) be the number of nodes that contribute their counts to Q. For each < i < h, let n, be the number of 
nodes at level ; that are maximally contained in Q (i.e. contributed their counts to Q according to the method above), 
so n{Q) = Lf=o"i- The following result bounds each n, and will guide us in choosing noise parameters 
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Lemma 2 Let Q be an arbitrary range query, and T be a spatial decomposition of height h in two dimensions. Then 
(i) IfT is a quadtree, n,- < 8 • 2''^' and 

<8(2''+i-l) =0(4''/2). 

(ii) IfT is a kd-tree, ni<S- 2L(/'-'+l)/2J and 

n{Q) < 8(2L(''+i)/2J+i _ 1) ^ (9(2V2). 

Proof. Let / denote the fanout of T: f — 4- for the quadtree and / = 2 for the kd-tree. Consider the left vertical extent 
of Q. If T is a quadtree, then for each rectangle at level / intersected by this vertical line, there are 2 rectangles at level 
i — 1 intersected by it. If T is a kd-tree, then for each rectangle that the vertical extent intersects at level / (assume wlog 
that level / corresponds to a vertical split), it intersects 1 at level / — 1 and 2 at level / — 2. In either case, the number of 
rectangles intersected by the left extent of Q grows by at most a factor / every two levels. 

We repeat this argument for the other three extents of Q. Additionally, the number of nodes that are maximally 
contained in Q can be bounded by the number of nodes that are intersected by Q, and the recurrence for n; follows. ■ 

We remark that a similar result extends to d dimensional decompositions, where the behavior is n(Q) ^ O ^ V'') ) 
for a tree of height h and fanout /. 



4.2 Budget Strategies 

The variance of the Laplace mechanism with parameter e, is Var(Lap(e,)) = 2/ef. Since the noise is independently 
generated in each node, we deduce that 

Err(e)=lto2«,/e?. (D 

For simplicity, we analyze strategies only for quadtrees; hence, n,- < 8 • 2''^' by Lemma[2|i). The analysis for kd-trees 
is similar. 

Uniform Budget: A natural strategy is to set e, = e/{h +1). Prior work that finds counts in trees (e.g. 1.1 1 J ) has used 
this model. Then Err(e) = ^^^lto«/ < ^(/! + 1)^(2''+^ - 1). 

Geometric Budget: We can significantly improve the query accuracy by considering a non-uniform budgeting strat- 
egy. Substituting the bound for n,- into ([T]) we obtain the following optimization problem to minimize the upper bound. 

Minimize Y^'l^^l''-' / ef 
Subject to: ^ ''^g £i' = £ • 



Lemma 3 An uppe r bound for En{Q) is 



£2(^-1)3 - £2 



which is attained for £,■ — 2''' ')/^£ ^^i^y^ ^ - 
Proof. By the Cauchy-Schwarz inequality, we have 



£,■)(!: 2'-7£2)>(f;^£,2"-7£2)2, 

=0 1=0 1=0 



with equality attained if and only if £, — C2'' ' /£f,yi, where C is a constant. Hence, £, = \fC2^^ Plugging this 
into L?=o = we obtain \JC = jl^jp-^- Then Err(2) is at most 

16X;2''-7(c2/322(''-')/3) = l6C-2/3^2(''-')/3 = 16£/C, 

1=0 i=0 
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Figure 2: Worst case behavior of Err(2) for uniform and geometric noise. 



and the result follows. ■ 

Lemma [3] shows that query accuracy is improved with a geometric budgeting scheme: starting from the root, the 
budget increases geometrically (by a factor of 2'/^), so the leaf counts are reported with highest accuracy. 

Studying the bound in the lemma, it seems that we should reduce h to reduce the noise. But this variance only 
bounds the error from noisy counts. We also have to account for the error due to queries which partly intersect some 
leaves, i.e., errors arising from the uniformity assumption. In the worst case, this error is proportional to the number 
of points in each leaf intersected by the query. We intersect «/, w 2^ leaves. On the average for an input of n points, we 
could have 0{n/4'') points per leaf (assuming balanced leaf counts). Hence, the uniformity assumption error behaves 
as (9(2''n/4'') — 0{n/2^). Then the overall error grows as (9(n/2'' +2''/^), suggesting we benefit more overall as h 
increases. 

Comparing strategies. Let 2 be a query that includes the maximum number of counts at each level, i.e., n, = 8 • 2''^' 
Figure 2 shows that worst case error for a uniform budget, Errunif(/i) = {h + 1)2(2''+' - 1) grows much faster than the 

geometric budget error, ¥jn:„^Q^[h) — - — 3^= — r^. This comparison assumes that Q touches a maximal number of 

nodes at each level. While in practice queries may touch fewer nodes, our experiments show that uniform noise is still 
significantly less accurate for a large number of different queries. 

Other budget strategies. There is a large class of strategies for dividing e along a path. For example, we could build 
a quadtree down to depth h and set Eh — E, i.e., allocate the entire budget to leaves (this approach is used in [ 12 1). In 
this case, queries are computed over the grid defined by the leaf regions and the hierarchical structure of the tree is 
irrelevant. Or we could conserve the budget by setting e, = for some levels /, and releasing no counts for those levels. 
Queries then use counts from descendant nodes instead. Conceptually, this is equivalent to increasing the fanout of 
nodes in the tree. 

Finally, we note that we do not have to use the same £,■ for all nodes on level /. For a generic query workload, 
this approach makes sense, as each node on level / is equally likely to be touched by the workload. However, if the 
workload is known a priori, one should analyze it to determine how frequently each node in the tree contributes to the 
answers. Then e„ could be larger for the nodes u that contribute more frequently, subject to the sum along each path 
remaining e. 



5 Optimizing Query Accuracy 

In the previous section we discussed how using geometric noise in a hierarchical decomposition can significantly 
improve the accuracy of queries. We now show that it is possible to further improve query accuracy via a post- 

'More rigorously, n, = min{8 ■ 2''^', 4*^'}, but this does not alter the analysis significantly. 



7 



processing of the noisy counts. The goal is to compute a new set of counts for which query errors are minimized. 
Note that this does not affect the privacy guarantee, as our method takes as input the output of the differentially private 
mechanism. 

Post-processing example. Consider a simple tree with one root a and four children, b, c, d and e. Let Yy denote 
the noisy count of node v e {a,b,c,d,e}, and assume first that we used uniform noise of e/2 on each count. There 
are two natural estimates of the true count of the root: itself, and the sum of counts of leaves, Yb + Yc + Yd +Ye. 
A first attempt is to estimate jS^ = Ya/2 + (F^ + Yc + Yci + Yg) /2, i.e., the average of the two estimates. In this case, 
Var(j3„) = Vai-(y„)/4 + 4Var(yft)/4 = (5/4) Var(F„), worse than directly using y„. But it turns out we can do better: 
setting j3„ =4y„/5 + + y, + F,/ + }',)/5 yields Var(i3„) = (4/5) Vai-(y„). 

For any non-uniform budgeting scheme, if the budget for a is Z\ and the budget of its children is eo, then jSa = 

4^2+^2 + 5^2^ {Yh + Yc + Yd + Ye) improves accuracy: one can derive Var(j3„) = " Var(y„) . ■ 

As this example illustrates, there are many possible ways of obtaining new counts as linear combinations of the 
published noisy counts. The choices increase exponentially with the tree size, as we can combine counts of ancestors, 
descendants, siblings etc. in various ways. Luckily, we do not need to explore them in order to compute the best 
solution. From linear statistical inference, we know that the best set of new counts is obtained via the ordinary least- 
squares estimate (OLS) f2T|. In general, computing the OLS for n unknowns requires solving a linear system with 
n X « matrices. Rather than explicitly inverting such (large) matrices, we present a linear time algorithm that achieves 
the same result, by taking advantage of the inherent symmetries of the matrices defined for the tree structure. Prior 
work considered the simpler case of uniform noise parameters 1 1 1 1. We give a solution to the more general case when 
all nodes at level / have the same Laplace parameter e, (this encompasses both uniform and geometric budgeting). The 
generalization requires some technical effort to prove, but yields a simple algorithm. 

Notation: As before, let / denote the fanout of the spatial index and let h denote its height. We use h{v) to denote the 
height of node v: h{v) = if v is a leaf, and h{mot) = h. We assume that the tree is complete, i.e., all paths have length 
h and all internal nodes have fanout /. Let m ^ v denote that m is a leaf in the subtree of v. Let anc(M) be the set of all 
ancestors of u, including node u. We use par(M) and child(M) to denote the parent, resp. the set of children, of node u. 
For linear inference over a tree structure, the definition of an OLS is as follows. 

Definition 3 Let Y denote the vector of original noisy counts, i.e., F,, is the noisy count of node v. Let Ey denote the 
noise parameter for node v. We denote by j3 the vector of counts after post-processing. Then ji is the ordinary least 
squares estimator (OLS) if it is consistent, i.e., jSy = L«Gchiid(v) Pufor all nodes v, and it minimizes Y^v^viYv ~ AO^- 

The OLS j3 has two powerful properties: It is unbiased for any query Q (recall that Y is also unbiased, since noise 
has mean 0). Most importantly, among all unbiased linear estimators derived from Y, it achieves minimum error for 
all range queries. In particular, it achieves smaller error than Y. 

The computation of )3 is based on the following result: 

Lemma 4 For any node v in the spatial index, the following recurrence holds (with Y and f} as in Definition^: 

^ ;=0 / wGanc(v)\{v} «^v"n'Ganc(t/) 

Proof. We use the notation defined in Section |5] In addition, we define for each node u the set desc(M) to be the 
descendants of u, including node u. We also use the shorthand anc(M, v) — anc(M) n anc(v). Let C„ denote the true 
count of node u. We want to show that if j3 is the OLS for C, then it satisfies the recurrence in the statement of the 
lemma. For any v 

y„ = ^c,.+Xy (3) 

where is the random variable drawn from the Laplace distribution with parameter £/,(,,-) . Therefore Var(y,,) — 2 /E/j^,,) . 

The set of equations (|3]l for all v can be written in matrix form as follows. Let n = be the number of leaves, and let 
m be the number of nodes in the tree. The binary matrix //,„x« representing the tree hierarchy is defined as //„ = 1 if 
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V ^ M and otherwise. Assume wlog that the nodes u are in breadth-first order, and the leaves are ordered from left to 
right. Then ^ can be written asY — H C+X, and the covariance matrix of Y is cov(F) — 2diag(l/e^^jj^j). 
We apply the standard transformations 

Z= (cov(F)/2)"ZF =diag(£,,(„))F and f/ = {cov{Y)/2)-^H 

to obtain the equation Z = U -C + diag(e/,(„)) X. Now cov(Z) = 21, where / the unit matrix, and the new equation fits 
the classical model. 

Since the OLS is consistent, i.e., j3,. := Li,^vj3„, it suffices to estimate the leaf counts, and the other estimates 
follow. A vector j3 is the OLS for the true count C if it minimizes (Z-UpyiZ-Up) (equivalent to Definition [s}. 
After differentiating, we have that j3 satisfies 

U'^Up = U^Z (4) 

By simple calculations, {U^U),,^. = IvGanc(H,H.) e^^(v) and (f/^Z)„ = L.-e^nclu) £h{v)Zv = 'Lveanc{u)£h(v/^- 
node V, we sum the corresponding rows on the left side of ^ to obtain Y.u~^v{U^U)uP = 



=EE E E EE^Mw)^^ 

z^yi(-<VH.-Ganc(«,z)\anc(v) n.'Ganc(v) u^vz^w 

-E E E ^k^)L(lp) 

«-<i'».Ganc(H)\anc(v)z-;K' wGanc(v) yz-dv / 

= E E E 

w-<vw£anc{H)\anc(v) ii''Ganc{v) 

= l'<^1L E a. E 

7=0 M^vn,ganc(n):/!(n')=./ n'Ganc(v) 

='E'e' E E %'(v.)A.' 

j=0 n'Ganc(v')\{v} 

= E^'/'A.+/'"^'' E ^M.)^'^- 

j=0 vt'Ganc(v)\{y} 

We now sum the corresponding rows on the right hand side of Q, i.e., we sum (f/^Z)„ over u -< v. Equating the 
left and right hand side, we obtain (|2]). ■ 

Lemma [4] provides an efficient way to compute j3. First, pre-compute the following array E of h + 1 entries: 
El = L;=o/''£y ■ Since £/ = £/_! +f'ef, this takes time 0{h). For any node v, define Z,, = LM-<vLwGanc(M) £A(»y)^vv We 
compute Z,, for all nodes v in two hnear traversals of the tree, as follows; 
Phase I: Top-down traversal 

We compute Z,, for all leaves v. Note that in that case, Z,, = LwGanc(v) £/?(h')^»'- ^"mt = sf^Yroot- In a top-down 
traversal of the tree, compute for each node u: a„ = Ctparfu) + ^h(u)^"- When we reach a leaf v, we set Z,, = a,,. 
Phase II: Bottom-up traversal 

We compute Z,, for all internal nodes v as Z^ = L„Gchiid(i )-Zi(; this requires a single bottom-up traversal. 
Phase III: Top-down traversal 

We now compute jSy for all nodes v. While doing so, we also compute an auxiliary value F,,, defined as F,, = 
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)\{v} Pw£u^.) ■ Note that Equation (|2| for v = root is: 

So we compute j3,-„o, = Zroot/Ek. In addition, let F,of,t = 0. For any node v 7^ roof, assume we have already 
computed j3,y and F^y for all w e anc(v) \ {v}. Then we compute Fy — fpar(v) + Ppar(v)£h{v)+i' From Equation (|2]), we 
find 

_ ~ /^(-) I.eanc(v)\{v} _ Z., - f^Fy 

From Lemma |4] and the above description we conclude: 

Theorem 5 The algorithm consisting of Phases (I)-(III) described above computes the OLS estimator in time linear 
in the size of the tree. 



In Section 8.2 we conduct an experimental evaluation which supports our theoretical result, and shows significant 
improvement in query accuracy using the OLS. 



6 Data-Dependent and Hybrid Trees 



As noted in Section [X2| publishing data-dependent or hybrid trees with privacy guarantees must overcome an addi- 
tional challenge: the tree structure itself could reveal private information. In most cases, the current node is split via 
a line through the current median value along some axis. For privacy reasons, we cannot reveal the exact median. In 



Section 6. 1 we discuss methods for computing a private median for a set of points. 

Extending the computation of private medians to a hierarchical structure requires some subtlety. Note that, unlike 
private counts, it is no longer the case that only root-to-leaf compositions matter. If a tuple is deleted from the dataset, 
it affects the (true) medians not only on its respective path, but also for nodes in different parts of the tree. To show the 
privacy of our approach, we appeal to the fact that the composition of differentially private outputs is well-understood 
in an interactive model, where a user asks a series of queries. We can imagine the following interaction: At the root 
level, our private algorithm A outputs a noisy median value mi . A user subsequently asks for the median of the points 
lying on one side of mi, and for the median of the points lying on the other side of mi (hence, mi becomes part of 
the user query). Algorithm A returns two noisy values, m2 and m^, each computed with respect to the user-specified 
subset. Hence, the computation can continue recursively, and it is again sufficient to consider sequential compositions 
only along root-to-leaf paths. Now observe that it is straightforward for the data owner to play both roles, and output 
the final result (without interaction) as the PSD. 



6.1 Private Medians 

We now outline several methods for computing a private median for a given set of points. These approaches are either 
implicitly or explicitly developed in prior work, but we are not aware of any previous comparison. Here, we bring 
together the different approaches, and give some new analysis on their behavior We compare them empirically in 
Section |8] 

Let C — {xi, X,,} be a (multi)set of n values in non-decreasing order in some domain range [lo,hi] of size 
hi — lo ^ M, and let x,n be its median value. We wish to compute a private median for C. We could apply the 
Laplace mechanism (Definition |2|i, i.e., return C{C) = x,„ + X, where X is Laplace noise. However, the sensitivity 
of the median is of the same order of magnitude as the range M (see ||20l [121 ). and the noise value X usually dwarfs 
x,n- A consequence of this is that the noisy median is frequently outside the range [lo,hi]. When used in a kd-tree 
construction, such a noisy median does not help to divide the data. Instead, we study the following four methods. 
Smooth Sensitivity 1,20,1 tailors the noise to be more specific to the set C. However, smooth sensitivity has slightly 
weaker privacy guarantees than the Laplace mechanism: it only satisfies so-called (e, 5) -differential privacy; see EOl 
for details. 
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Definition 4 (from Claim 3.3 ofHWj) Let < e, 5 < 1, and let ^ = jpipj^^/gy) ■ The smooth sensitivity of the median 
is defined as 

CTi (median) = max (e^*^^ max {x,n+t — Xm+t-k-i))- 

0<k<n 0<t<k+l 

(where we define x; :— lo if i < and Xj :— hi ifi > n). 

The smooth sensitivity mechanism for median is defined as SS(C) — x„, + ^ X where X is a random variable 
drawn from the Laplace distribution with parameter 1 and (7s = (median). 

Exponential Meclianism is a general method proposed in |fT9l as an alternative to the Laplace mechanism: instead 
of adding random noise to the true value, the output value is drawn from a probability distribution over all possible 
outputs, so that the differential privacy condition is satisfied. Applied to median, we obtain the following algorithm: 

Definition 5 For any x € [lo,hi], let rank(x) denote the rank of x in C. The exponential mechanism EM returns x with 

Pr[EM(C) =x] cx g-f |rankW-rank(x»,)l. 

Since all values x between two consecutive values in C have the same rank, they are equally likely to be chosen. 
Thus, EM can be implemented efficiently by observing that it chooses an output from the interval 4 — [xkjXk+i) with 
probability proportional to |4|e^2l'^^'"l. Conditional on 4 being chosen in the first step, algorithm EM then returns a 
uniform random value in 4- 

Cell-based Method is a heuristic proposed in f26\. It imposes a fixed resolution grid over C then computes the median 
based on the noisy counts in the grid cells. When applied to a hierarchical decomposition, a fixed grid is computed 
over the entire data, then medians are computed from the subset of grid cells in each node. Cell counts have sensitivity 
1 . The accuracy of this method depends on the coarseness of the grid relative to the data distribution. 

Noisy Mean is a heuristic from (VT], which replaces median with mean. A private mean can be computed privately 
by computing a noisy sum (with sensitivity M) and a noisy count (with sensitivity 1), and outputting their ratio. If the 
count is reasonably large, this is a fair approximation to the mean, though there is no guarantee that this is close to the 
median. 

Provided the data is not too skewed, smooth sensitivity and exponential mechanism have constant probability of 
choosing a good split; i.e., the noisy median has a constant fraction of the data points on each side. We say that the data 
is not too skewed if it obeys the "80/20 rule": the central 80% portion of the data covers at least 20% of the entire data 
range M (hence, it's not too concentrated). Formally, (x4„/5 — x„/5) > M/5. For smooth sensitivity, we also require n 
to be large enough that ^n is at least a small constant (where ^ is as in Definition|4|. 

Lemma 6 Let C be such that X4„/5 — x„/5 >M/5. 

(/jPr[SS(C) e [x„/5,X4„/5]|«^ > 4.03] > 0.5(1 -e-^/4); 
(njPr[EM(C)e[x„/5,X4„/5]]>i. 

Proof (i) 

Observe that < max( max e^'^^{x„,>k+i—Xm-k-i)-, max e^'^^ {x,„+k+\ ~ ^m-k-i)) 

0<k<2n/5 2n/5<k<n 

< max((x4„/5 -x„/5),£'-2^"/^M) 

Using our assumptions, e^^^"I^M <M/5 < {x4„/s ^Xn/5)i so C7j < (x4„/5 — x^/j). Consider the case when the median 
is closer to x„/5 than to X4„^^. Then the output of SS is within the desired range if < {2as/e)X < (7j/2, i.e., X < e/4. 
Symmetrically, if the median is closer to X4„/5 , the noisy median is in the desired range if > {2as/£)X > —as/2, i.e., 
X > — e/4. Since X is drawn from a symmetric distribution, we conclude that 

Pr[SS(C) e [x„/5,X4„/5]|^n > 4.03] > Pr[0 <X < e/4] = 

^This computation is implicit in McSheiry's PINQ system 1171 . 
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(ii) Let a be the proportionality constant impUcit in Definition js] Let E be the event that EM(C) € [xn/SjMn/s]- 
Then 

and 1 - Pr[£'] = Pr[^£'] < ae-°-^^''"M. 
Using our assumption, we have 1— Pr[£'] <5Pr[£']. Hence Pr[£'] > 1/6. ■ 

6.2 Balancing Privacy Budgets 

As discussed before, the privacy guarantee of a data-dependent tree is obtained by composing the individual privacy 
guarantees for medians and counts along each root-to-leaf path, as in Lemma[T| 

Let h be the height of the tree and let A"\ 1 < i < h,he the noisy median algorithms corresponding to the internal 
nodes on a path, such that Af is e-"-differentially private. Let A'j,0 <i < h, be the noisy count algorithms correspond- 
ing to the same nodes, plus the leaf, such that is e?-differentially private. Then the PSD released is £-differentially 
private, where 

h h 

i=\ i=0 

(assuming that sums are equal along each path. Else, e is the maximum sum along any path.) 

Median vs. count noise. In our experiments, we consider separate allocations of the e budget for the median compu- 
tation, and for the count computation. An important consideration is that larger values of yield smaller noise (in a 
probabilistic sense). However, the noise magnitude has different consequences on the overall accuracy of the tree, for 
median vs. count. Large count noise gives more uncertainty on the count of each region in the tree which Q touches. 
By contrast, large median noise may result in a skewed split in some internal node u. Hence, the children of u are 
unbalanced in terms of number of points in their leaves, as well as the size of regions they represent. However, queries 
that touch the descendants of u may still perform well, provided the respective noisy counts are accurate enough. Still, 
we cannot neglect median finding entirely. If the median does not divide the current point set so that there are at least 
a constant fraction of points on each side, then we are essentially wasting a level of the tree. This becomes more of a 
challenge deeper in the tree, as the point set sizes shrink. 
Let 

£median — Y!i=i ^i" ^nd Ccount — I^'Lq^i'- Study different strategies for choosing these values such that 
£median + Ecount = £, by analyzing their impact on various query loads. 

Budgeting median noise. Once emedian is fixed, we must distribute it among internal nodes. A simple strategy is 
uniform budgeting, i.e., £■" = emedian//^- The hybrid tree approach, which switches to quadtree splitting after £ levels 
implies setting e"' = Emedian/^ for h — £ < i < h and = for </</; — ^. 

Flattening the kd-tree. From Lemma |2] and the computation of query errors from Section |4j it seems that a (2D) 
kd-tree built over the same number of leaves as a quadtree has significantly worse accuracy. The reason is that the 
kd-tree has twice the height of the quadtree, so the noise budget is divided into twice as many pieces. To better 
compare quadtrees and kd-trees, we can ensure that both have the same fanout of 4. We propose "flattening" the 
kd-tree: connect the root to its four grandchildren, and recurse down the tree, skipping every other level. (This is akin 
to setting e,- — for every other level). The bound on n{Q) over a flattened kd-tree is now the same as for a quadtree 
of the same height, following from Lemma[2] From now on, we assume all trees have fanout / = 4. 

7 Further Enhancements 

In this section we discuss how to apply two popular strategies — sampling and pruning — in the context of differential 
privacy. Sampling can be used to significantly improve the running time for computing data-dependent trees. Pruning 
can improve the query accuracy for both data-independent and data-dependent trees. 

Sampling to reduce computation time. When data is large, computing differentially private functions can be very 
time consuming, even when the functions take near-linear time. A natural technique is to compute the function on only 
a small sample of the full data to speed up the computation. Intuitively, samphng is very compatible with differential 
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Figure 3: Query accuracy of quadtree optimizations 



privacy: the influence of any individual is reduced, since they may not be included in the sample. The next result 
extends Kasiviswanathan et al. UlAVi 



Theorem 7 Given an algorithm A which provides e-differential privacy, and < p < I, including each element of 
the input into a sample S with probability p and outputting A{S) is 2pe^ -differentially private. 

Treating 2e^ as a constant (it is between 2 and 5.5 for < e < 1), it is sufficient to sample at a rate of « e'/lO to 
achieve e'-differential privacy. For large enough data, sampling at a rate of, say, 1% and applying Laplace noise with 
parameter 0.9 achieves 0.1 -differential privacy — ^but it processes a data set two orders of magnitude smaller. We use 



this result to compute noisy medians for data-dependent and hybrid trees, via the SS and EM methods from Section 6. 1 
We conclude that sampling makes both methods an order of magnitude faster. It only marginally deteriorates the 
accuracy of the exponential mechanism, and it improves the accuracy of the smooth sensitivity-based noise. See 
Section lO 

Sampling is not useful for noisy counts: computing a sample (independently per node) and computing the exact 
count both require linear time. However, sampling introduces more inaccuracy in the noisy count. For the same reason, 
the cell-based method for noisy medians (Section[6T|) is implemented without sampling. 



Pruning. Due to the uneven nature of the splitting process, it is possible that both data dependent and data independent 
trees contain some nodes with few or no points. Dividing such sets can be counter-productive: the noise added to the 
counts of their descendants may add up for queries which intersect their region. Instead, it is preferable to cut off the 
tree at this point. This requires some care: the choice to stop cannot be based on the "true" count of the points, as 
this does not meet the differential privacy requirement. Instead, we can choose to stop if the noisy count estimated 
for a node in the tree is less than threshold m. In our setting, we choose to apply this pruning after the postprocessing 
procedure of Section[5] which operates on a complete tree. We observed that this improves empirically over retaining 
all nodes. 



8 Experimental Study 

We begin by studying design choices for each class of methods: the choice of medians for data-dependent PSDs, 
different strategies for budgeting and post-processing for data-independent PSDs. We then study the effectiveness of 
query answering for the different approaches. Finally, we apply our findings to a particular application lfT2l . 

8.1 Experimental Environment 

To evaluate the accuracy of query answering of various PSDs, we experimented on a mixture of real and synthetic 
data. Real data was drawn from the 2006 TIGER/Line dataserl Here, we present results using GPS coordinates of 



'via the exposition at|http : / /adamdsmith . word 


press . com/200 9/0 9/02/ sample- secrecy/ 


' http : / /www . census . gov/ geo/www/ tiger/ 
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Figure 4: Private medians accuracy and efficiency 



road intersections in the states of Washington and New Mexico. This data represents a rather skewed distribution 
corresponding roughly to human activity, so we treat it as the locations of individuals which should be kept private. 
The dataset has 1.63 million coordinates in the range [—124.82,-103.00] x [31.33,49.00]. We conducted experiments 
on other data sets as well, including synthetic 2D data with various distributions and TIGER/Line data of different 
regions, and obtained similar results. 

We show results for rectangular queries where query sizes are expressed in terms of the original data. For example, 
since 1 degree is approximately 70 miles, (15, 0.2) indicates a "skinny" query of 1050 x 14 miles. We consider several 
query shapes; for each shape we generate 600 queries that have a non-zero answer, and record the median relative error 

All experiments were conducted on a 2.80GHz CPU with 8GB RAM, so the data fits easily in memory. We 
implemented our PSDs in Python 2.6 with scientific package Numpy. 



8.2 Experimental Results 

Budget and post-processing. In Sections |4]and|5]we proposed two techniques to improve query accuracy for PSDs: 
geometric noise allocation, and the post-processing method. We compare quadtrees with geometric budgeting (quad- 
geo), post-processing (quad-post) and both combined (quad-opt) to the baseline approach of uniform budget with no 
post-processing (quad-baseline). Figure[3]shows the relative errors when all trees are grown to the same height, h — 10. 
Clearly, each optimization significantly improves the query accuracy, and in combination they perform even better: 
the relative error is reduced by up to an order of magnitude, especially when the privacy budget is limited (e — 0.1). 
We observe the same improvement on other PSDs, so all subsequent results are presented with both optimizations. 



Quality of private medians. In Section 6.1 we described several approaches to private medians: smooth sensitivity 
(SS), exponential mechanism (EM), noisy mean (NM) and the cell-based approach (cell). To show their relative 
behavior, we compare them on a synthetic one-dimensional dataset with 2^" data points distributed uniformly within 
a domain of [0,2^^] (the same relative performance was seen on other distributions). We build a binary tree structure 
with splits found by each mechanism, and measure the average (normalized) rank error of the private medians for each 
level. In the worst case the noisy median may fall out of the data range [jci ,x„] and have 100% relative error 

Figure[4]shows the accuracy and time for median finding with privacy budget e — 0.01 at each level (counting from 
the root down). For SS, which is (e, 5) -differentially private, we set 5 = 10^^. We also show the result of combining 
sampling with probability p = 1% with SS and EM, obtaining methods SS, and EMs respectively. 

We argued that efficiency could be improved by combining either method with sampling. The exponential mech- 
anism (EM) is always the most accurate, providing almost true medians for large data sizes (2^*^ points at depth 0) 
and medians with moderate error for smaller data sizes (around 2'^ points at depth 9) without sampling. For smooth 
sensitivity, the sampling (SSs) approach improves the accuracy, since the increase in privacy budget outweighs the 
effect of fewer data points. The reverse is true for EMs: although the privacy budget becomes about 50 times larger. 
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Figure 5: Query accuracy for kd-trees 
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Figure 6: Query accuracy comparison 



the ranks are about 100 times smaller, so the probability associated with elements far from the median is correspond- 
ingly larger. Nevertheless, as Figure 4(b) shows, there is a commensurate performance improvement: the sampling 
versions are about an order of magnitude faster across all data sizes. Although fast, the noisy mean approximation 
(NM) gives poor quality medians for smaller data sizes. The cell-based approach (cell length 2'") has larger error for 
large data sizes while being much slower than others. Based on these results, we use the exponential mechanism as 
the default method for noisy medians in all other experiments, and advocate the use of sampling when the data under 
consideration is large. 

Other parameter settings. We summarize our findings on parameter settings. Detailed plots are omitted for brevity. 

Hybrid trees. A hybrid tree switches from data dependent splitting to data independent splitting at a "switch level" i. 
We found that switching about half-way down the tree (height 3 or 4) gives the best result over this data set. 

Median vs. count noise. As we increase Emedian and decrease Ecount, we obtain more accurate divisions of the data, but 
less accurate counts. Although it may vary for different queries and PSDs, in most cases the best results were seen 
when budget was biased towards the node counts, allocated roughly as £count — 0.7£ and £rnedian — 0.3£, SO we adopt 
this division. 

Hilbert curve resolution. Recall that we can treat Hilbert R-trees as one-dimensional kd-trees, and so obtain PSDs. 
First we set the order of the Hilbert curve. Given the large domain and skewed data distribution, a Hilbert curve of 
order 23 is needed to differentiate each individual point in the dataset. However, since the leaves of our PSDs should 
contain several points, we found similar accuracy for all resolutions in the range 16 to 24, and use curves of order 18 
in our experiments. 

Comparison of kd-trees. We compare the accuracy of queries among the kd-tree with EM medians {kd- standard), 
the hybrid kd-tree {kd-hybrid), the cell -based kd-tree proposed in |26| {kd-cell) with cell length 0.01 and the noisy 
mean based kd-tree from lfT2i {kd-noisymean). To see "the cost of privacy", we also include some results on non- 
private structures: the exact kd-tree {kd-pure) and the kd-tree which uses exact medians but noisy counts {kd-true). In 
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Figure 7: Time efficiency and results on private record matching 



all experiments, the kd-trees have the same (pre-determined) maximum number of levels h — 8 (with fanout 4). We 
also use a pruning condition based on frequency: we remove descendants of any node whose noisy count is below a 
threshold m — 32. 

Figure |5] shows the result of comparing kd-tree variants across a range of privacy budgets (£ = 0.1,0.5, 1.0). We 
first observe that even a pure kd-tree without noise of any kind introduces some error because of the uniformity 
assumption for leaves that intersect, but are not contained in the query. Comparing kd-true and kd-pure, we see that 
adding noise to counts does not dramatically degrade query accuracy: it remains below 1% relative error Rather, it is 
the noise in the choice of medians which seems to have the greater impact on query accuracy, as is evident from the 
results for the PSDs. Since the splitting points do not evenly divide the data, particularly at greater depths in the tree, 
there are some leaves with a much larger number of points than in the exact trees. This explains the weak performance 



of kd-noisymean, whose choice of splitting points was seen to be poor in Figure 4(a) 

The kd-cell approach has good performance on queries which are small and square. However, as the query area 
grows, the method looses its edge, and is dominated by kd-hybrid. Overall, the accuracy of query answering from all 
cases is good, with small relative errors of less than 10% even for high privacy (e — 0. 1). Of the kd-tree variations, the 
kd-hybrid seems the most reliably accurate. 

Comparison of PSDs. Figure |6] shows the query accuracy for the best performing instances of the representative 
methods: optimized quadtree, cell-based kd-tree, hybrid kd-tree and Hilbert R-tree on three queries of different shapes. 
In this set of experiments we varied the depth of the trees (h =6 to 1 1) while keeping the privacy budget fixed at e = 0.5. 
At depth 10 the optimized quadtree is able to provide the best accuracy of all. It seems that the ability to devote all 
of the privacy budget to noisy counts for quadtrees outweighs the ability of kd-trees to more evenly split the data via 
private medians. However, the hybrid kd-tree at height h = E obtains about the same accuracy as the quadtree at height 
h = 10 for the larger queries. The cell-based kd-tree, which first imposes a uniform grid over the domain, provides 
good accuracy on small square queries since the node shape matches the query shape. However, kd-cell performs worst 
on the larger queries, which require probing more nodes. We also note that the private Hilbert R-tree has comparably 
good performance on some queries, but much higher errors on others. 



Scalability. Figure 7(a) illustrates the time taken to build each spatial decomposition using our prototype code. For 
the data sets considered, PSD construction time was not a bottleneck, since we typically consider this to be a one-time 
cost. In general, the structures which only divide the domain take much less time than the data dependent approach, 
while hybrid kd-tree lies in-between, at around a minute. The cell-based kd-tree also takes a long time, due to first 
materializing noisy counts of many cells, then building a tree on top of them. The Hilbert R-tree takes even more time, 
due to the higher cost of encoding and decoding coordinates in Hilbert space. The count post-processing step adds 
only a few seconds to the overall cost. 
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8.3 Application to Private Record Matching. 



The kd-noisymean tree was originally introduced to solve a private record matching problem — see the full details 
in IIT2I . Here, a differentially private data structure is used to focus in on areas of potential overlap between datasets 
before an expensive secure multiparty computation (SMC) procedure is invoked. The metric of interest is the reduction 
ratio, which measures how much SMC work is saved relative to the baseline of no elimination, so bigger is better. 
We obtained the code from the authors of fT2l and ran the same experiments on the same data. Figure 7(b) shows 
reduction ratio for quad-baseline, kd-noisymean and kd-standard (in this application, all count budget is allocated 
to leaves and thus post-processing does not apply). We observe that the new kd-tree approach we have proposed 
can improve appreciably over the two methods tried in fvT\. Note that improving reduction ratio from 0.93 to 0.95 
represents 28% less SMC work. 



9 Concluding Remarks 

We have presented a comprehensive framework for differentially private spatial decompositions, and shown how 
to produce private versions of many popular methods, such as quadtrees, kd-trees and Hilbert R-trees. We have 
proposed novel techniques for setting hierarchical noise parameters in a non-uniform manner that minimizes query 
error. Further, we have developed a post-processing technique that re-computes node counts based on the initial noisy 
counts to optimize query accuracy. This new technique applies to a large class of other settings in privacy. 

In the process of computing private data-dependent decompositions, we have provided the first survey of techniques 
for computing private medians and derived theoretical results that give insight into the expected behavior of the two 
most accurate methods. We have shown how to combine all these results (plus other techniques such as sampling) 
into a single whole that achieves the desired privacy guarantee. In our ongoing work, we study other settings: sparse 
categorical data [3J, and higher dimensional data. 

Our experimental study is consistent with our theoretical insights, and shows that each of our techniques leads to 
significant improvements in query accuracy, running time or both. For most PSDs, we achieve relative query errors in 
single-digit percentages, while guaranteeing strong differential privacy. We conclude that PSDs represent an efficient 
and accurate way to release spatial data privately. 
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